Indoor scene understanding is central to applications such as robotnavigation and human companion assistance. Over the last years, data-drivendeep neural networks have outperformed many traditional approaches thanks totheir representation learning capabilities. One of the bottlenecks in trainingfor better representations is the amount of available per-pixel ground truthdata that is required for core scene understanding tasks such as semanticsegmentation, normal prediction, and object edge detection. To address thisproblem, a number of works proposed using synthetic data. However, a systematicstudy of how such synthetic data is generated is missing. In this work, weintroduce a large-scale synthetic dataset with 400K physically-based renderedimages from 45K realistic 3D indoor scenes. We study the effects of renderingmethods and scene lighting on training for three computer vision tasks: surfacenormal prediction, semantic segmentation, and object boundary detection. Thisstudy provides insights into the best practices for training with syntheticdata (more realistic rendering is worth it) and shows that pretraining with ournew synthetic dataset can improve results beyond the current state of the arton all three tasks.
展开▼